Data Science Design Pattern for Student Drop Out

Microsoft

Introduction

Welcome to the Data Science Design Pattern for Student Drop Out. This pattern provides a starting point for the data scientist exploring a new dataset. By no means is it the end point of the data science journey. The pattern is under regular revision and improvement and is provided as is.

We now introduce a generic pattern for building multiple binary classification models using R.

Pre-configuration

We load the R packages required for modelling.

########################################################################
# R SETUP

# Load required packages from local library into R.

library(magrittr)     # Data pipelines: %>% %T>% %<>%.
library(stringi)      # String operator: %s+%.
library(rattle)       # Evaluate using riskchart().
library(unbalanced)   # Resampling using ubSMOTE.
library(rpart)        # Model: decision tree.
library(rpart.plot)   # Draw fancyRpartPlot().
library(randomForest) # Model: random forest.
library(ada)          # Model: ada boosting.
library(party)        # Model: ctree and cforest.
library(e1071)        # Model: support vector machine.
library(nnet)         # Model: neural network.
library(Matrix)       # Construct a Matrix of a class that inherits from Matrix.
library(caret)        # Tune model hyper-parameters.
library(xgboost)      # Model: extreme gradiant boosting.
library(Ckmeans.1d.dp)# Plot feature importance using xgb.plot.importance.
library(DiagrammeR)   # Plot xgboost tree using xgb.plot.tree.
library(ROCR)         # Use prediction() for evaluation.
library(pROC)         # Use auc() for evaluation. 
library(ggplot2)      # Visually evaluate performance.

Step 4.4: Re-load Dataset

In the Data template we loaded the studentDropIndia dataset, processed it, and saved it to file. Here we re-load the dataset and review its contents. In addition, we define some support functions for evaluation.

########################################################################
# DATA INGESTION

# Identify the dataset.

dsname <- "studentDropIndia"

# We define some support functions that we often find useful.

evaluateModel <- function(data, observed, predicted) 
{ 
  # Calculate the confusion matrix
  
  confusion <- table(data[[observed]], data[[predicted]], dnn=c("Observed", "Predicted"))
  confusion %>% print()
  
  # Calculate the performance metrics
  
  tp <- confusion[rownames(confusion) == 1, colnames(confusion) == 1]
  fn <- confusion[rownames(confusion) == 1, colnames(confusion) == 0]
  fp <- confusion[rownames(confusion) == 0, colnames(confusion) == 1]
  tn <- confusion[rownames(confusion) == 0, colnames(confusion) == 0]
  
  accuracy <- (tp + tn) / (tp + fn + fp + tn)
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  fscore <- 2 * (precision * recall) / (precision + recall)
  
  # Construct the vector of performance metrics
  
  metrics <- c("Accuracy" = accuracy,
               "Precision" = precision,
               "Recall" = recall,
               "F-Score" = fscore)
  
  # Return the vector of performance metrics
  
  return(metrics)
}

rocChart <- function(pr, target)
{
  # Calculate the true positive and the false positive rates.
  
  rates <- pr %>%
    prediction(target) %>%
    performance("tpr", "fpr")
  
  # Calulcate the AUC.
  
  auc <- pr %>%
    prediction(target) %>%
    performance("auc") %>%
    attr("y.values") %>%
    extract2(1)
  
  # Construct the plot.
  
  pl <- data.frame(tpr=attr(rates, "y.values")[[1]], 
                   fpr=attr(rates, "x.values")[[1]]) %>%
    ggplot(aes(fpr, tpr)) +
    geom_line() +
    annotate("text", x=0.875, y=0.125, vjust=0,
             label=paste("AUC =", round(100*auc, 2)), 
             family="xkcd") +
    xlab("False Positive Rate (1-Specificity)") +
    ylab("True Positive Rate (Sensitivity)")
  
  # Return the plot object.
  
  return(pl)
}

# Identify the dataset to load.

fpath  <- "data"
dsdate <- "_" %s+% "20161215"

# Filename of the saved dataset.

dsrdata <-
  file.path(fpath, dsname %s+% dsdate %s+% ".RData") %T>% 
  print()
## [1] "data/studentDropIndia_20161215.RData"
# Load the R objects from file and list them.

load(dsrdata) %>% print()
##  [1] "ds"     "dsname" "dspath" "dsdate" "nobs"   "vars"   "target"
##  [8] "id"     "ignore" "omit"   "inputi" "inputs" "numi"   "numc"  
## [15] "cati"   "catc"
# Review the metadata.

dsname
## [1] "studentDropIndia"
dspath
## [1] "C:/Demo/EducationAnalytics/Data/studentDropIndia_20161215.csv"
dsdate
## [1] "_20161215"
nobs
## [1] 19100
vars
##  [1] "continue_drop"      "gender"             "caste"             
##  [4] "mathematics_marks"  "english_marks"      "science_marks"     
##  [7] "science_teacher"    "languages_teacher"  "guardian"          
## [10] "internet"           "total_students"     "total_toilets"     
## [13] "establishment_year"
target
## [1] "continue_drop"
id
## [1] "student_id" "school_id"
ignore
## [1] "student_id" "school_id"
omit
## NULL

Step 4.5: Prepare - Formula to Describe the Goal

We continue on from the Data module where we had Steps 1, 2, and 3 and the beginnings of Step 4 of a data mining process.

The next step is to describe the model to be built by way of writing a formula to capture our intent. The formula describes the model to be built as being constructed to predict the target variable based on the other (suitable) variables available in the dataset. The notation used to express this is to name the target (continue_drop), followed by a tilde (~) followed by a period (.) to represent all other variables (these variables will be listed in vars in our case).

########################################################################
# PREPARE FOR MODELLING

# Formula for modelling.

form <- ds[vars] %>% formula() %T>% print()
## continue_drop ~ gender + caste + mathematics_marks + english_marks + 
##     science_marks + science_teacher + languages_teacher + guardian + 
##     internet + total_students + total_toilets + establishment_year
## <environment: 0x00000000268ae848>

A common methodology for model building is to randomly partition the available data into a training dataset and testing dataset. We sometimes also introducing a third dataset called the validation dataset, used during the building of the model, but for now we will use just the two.

First we (optionally) initiate the random number sequence with a randomly selected seed, and report what the seed is so that we could repeat the experiments presented here if required. For consistency in this module we use a particular seed of 123.

Next we partition the dataset into two subsets. The first is a 70% random sample for building the model (the training dataset) and the second is the remainder, used to evaluate the performance of the model (the testing dataset).

# Initialise random numbers for repeatable results.

seed <- 123
set.seed(seed)

# Partition the full dataset into two.

train <- 
  sample(nobs, 0.70*nobs) %T>% 
  {length(.) %>% print()}
## [1] 13370
head(train)
## [1]  5493 15056  7811 16863 17960   870
test <- 
  seq_len(nobs) %>%
  setdiff(train) %T>%
  {length(.) %>% print()}
## [1] 5730
head(test)
## [1]  2  3  6 15 16 17

Step 5: Resampling - Rebalancing the Proportion of Minority over Majority (Optional)

Since the proportion of minority class (student dropping-out) is around 5% among the whole dataset, we here implement the SMOTE on the training dataset by using the function ubSMOTE from the R package “unbalanced”. This yields a dropping-out proportion of 23% among all the training data. By using the training dataset after SMOTE as the modeling input, we can greatly improve the model performance, especially when applying some of the algorithms not suitable for unbalanced data.

# Rebalance the training dataset.

traindata <- as.data.frame(ds[train, inputs])
traintarget <- as.factor(as.numeric(as.data.frame(ds[train, target])[[1]])-1)

smote <- ubSMOTE(X=traindata, Y=traintarget,
                 perc.over=200, perc.under=500,
                 k=3, verbose=TRUE) 

trainsmote <- cbind(smote$X, smote$Y)
names(trainsmote)[names(trainsmote) == "smote$Y"] <- "continue_drop"

traindata <- trainsmote

# Check the dropping-out proportion

table(traindata$continue_drop)/nrow(traindata)
## 
##         0         1 
## 0.7692308 0.2307692

Step 6.1: Build - Decision Tree Model

The commonly used classification model builders include rpart() decision tree, randomForest() random forest, ada() stochastic boosting, ect. Now we build an rpart() decision tree, as a baseline model builder. Note that our models from now on are all built on the original training dataset in the purpose of demonstration.

# Train model: rpart

ctrl <- rpart.control(maxdepth=3)
system.time(m.rp <- rpart(form, ds[train, vars], control=ctrl))
##    user  system elapsed 
##    0.72    0.02    0.20
m.rp
## n= 13370 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13370 645 continue (0.95175767 0.04824233)  
##    2) english_marks< 0.9975 13334 609 continue (0.95432728 0.04567272)  
##      4) science_teacher< 5.5 9625 244 continue (0.97464935 0.02535065) *
##      5) science_teacher>=5.5 3709 365 continue (0.90159073 0.09840927)  
##       10) english_marks>=0.228 3671 327 continue (0.91092345 0.08907655) *
##       11) english_marks< 0.228 38   0 drop (0.00000000 1.00000000) *
##    3) english_marks>=0.9975 36   0 drop (0.00000000 1.00000000) *
# Record the type of the model for later use.

mtype <- "rpart" 

We can also draw the model.

fancyRpartPlot(m.rp)

Step 6.2: Evaluate - Decision Tree Model

As we have noted though, performing any evaluation on the training dataset provides a biased estimate of the actual performance. We must instead evaluate the performance of our models on a previously unseen dataset (at least unseen by the algorithm building the model).

So we now evaluate the model performance on the testing dataset.

# Score model

predictions <- predict(m.rp, ds[test, vars], type="prob")
threshold <- 0.5
rpart_probability <- predictions[, 2]
rpart_prediction <- ifelse(rpart_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], rpart_prediction, rpart_probability)
head(pred)
##   continue_drop gender caste mathematics_marks english_marks science_marks
## 1      continue      f    bc             0.290         0.512         0.290
## 2      continue      f    oc             0.602         0.666         0.602
## 3      continue      f    bc             0.594         0.519         0.594
## 4      continue      f    bc             0.461         0.524         0.461
## 5      continue      f    oc             0.742         0.672         0.742
## 6          drop      f    bc             0.503         0.523         0.503
##   science_teacher languages_teacher guardian internet total_students
## 1               4                 7   mother     true            356
## 2               4                 2   mother    false            179
## 3               4                 8   mother     true            335
## 4               0                 3   mother     true            469
## 5               3                12   mother     true            132
## 6               9                 0   father     true            397
##   total_toilets establishment_year rpart_prediction rpart_probability
## 1            14               1943                0        0.02535065
## 2             8               1955                0        0.02535065
## 3            43               1916                0        0.02535065
## 4            14               1905                0        0.02535065
## 5            14               1996                0        0.02535065
## 6             5               1950                0        0.08907655
# Evaluate model

pred$continue_drop <- as.numeric(pred$continue_drop)-1

metrics.rp <- evaluateModel(data=pred,
                            observed="continue_drop",
                            predicted="rpart_prediction")
##         Predicted
## Observed    0    1
##        0 5475    0
##        1  229   26
metrics.rp
##  Accuracy Precision    Recall   F-Score 
## 0.9600349 1.0000000 0.1019608 0.1850534
rocChart(pr=pred$rpart_probability, target=pred$continue_drop)

Step 6.3: Compare - Multiple Models using Experiment

We can repeat the modelling multiple times, randomly selecting different datasets for training, to get an estimate of the actual expected performance and variation we see in the performance. The helper function experi() can be used to assist us here. It is available as http://onepager.togaware.com/experi.R and we show some of the coding of experi() below.

# Show the function experi()

experi <- function(form, ds, dsname, target, modeller, details="",
                   n=100, control=NULL,
                   keep=FALSE, # Keep the last model built.
                   prob="prob",   
                   class="class",
                   log="experi.log")
{
 suppressPackageStartupMessages(require(pROC))
  
 user <- Sys.getenv("LOGNAME")
 node <- Sys.info()[["nodename"]]
 
 wsrpart.model <- modeller=="wsrpart"
 
 numclass <- length(levels(ds[,target]))
 
 start.time <- proc.time()
 
 seeds <- cors <- strs <- aucs <- accs <- NULL
 for (i in seq_len(n))
{
 loop.time <- proc.time()
 
 seeds <- c(seeds, seed <- sample(1:1000000, 1))
 set.seed(seed)
 
....

 result[-c(1:7)] <- round(result[-c(1:7)], 2)
 row.names(result) <- NULL
 if (keep)
 {
  if (numclass==2)
  {
   attr(result, "pr") <- pr
   attr(result, "test") <- test
  }
  attr(result, "model") <- model
 }
}
return(result)
}

Let’s run the experiments using the algorihtms rpart (Therneau and Atkinson, 2014), randomForest (Breiman et al., 2012), ada (Culp et al., 2012), ctree() from party (Hothorn et al., 2013). In such way, we can conveniently implement those models and compare their performance.

# # Source experi.R 
# 
# source("http://onepager.togaware.com/experi.R")
# 
# # Set the times of loops
# 
# n <- 10
# 
# # Run experiments
# 
# ex.rp <- experi(form, ds[vars], dsname, target, "rpart", "1", n=n, keep=TRUE)
# ex.rf <- experi(form, ds[vars], dsname, target, "randomForest", "500", n=n, keep=TRUE, control=list(na.action=na.omit))
# ex.ad <- experi(form, ds[vars], dsname, target, "ada", "50", n=n, keep=TRUE)
# ex.ct <- experi(form, ds[vars], dsname, target, "ctree", "1", n=n, keep=TRUE)
# 
# # Compare results
# 
# results <- rbind(ex.rp, ex.rf, ex.ad, ex.ct)
# rownames(results) <- results$modeller
# results$modeller <- NULL
# results

Step 7.1: Other Models - Support Vector Machine Model

Except for the above commonly used binary classification models, we could also try some more advanced models, for instance, svm(), support vector machine, nnet(), neural network, xgboost(), extreme gradient boosting, ect. We firstly build a svm() support vector machine model here.

# Tune hyper-parameters

 system.time({
 m.svm.cv <- tune.svm(form,
                      data=ds[train, vars],
                      gamma=2^(-1:1),
                      cost=2^(2:4),
                      type="C-classification",
                      probability=TRUE,
                      scale=FALSE)
 })
##     user   system  elapsed 
## 12094.02    18.05 12115.06
print(m.svm.cv$best.performance)
## [1] 0.034181
# Train model: svm

system.time({
  m.svm <- svm(form, 
               data=ds[train, vars], 
               gamma=as.numeric(m.svm.cv$best.parameters[1]), 
               cost=as.numeric(m.svm.cv$best.parameters[2]),
               type="C-classification",
               probability = TRUE,
               scale = FALSE)
})
##    user  system elapsed 
##  180.69    0.17  180.88
# Check the model information

m.svm
## 
## Call:
## svm(formula = form, data = ds[train, vars], gamma = as.numeric(m.svm.cv$best.parameters[1]), 
##     cost = as.numeric(m.svm.cv$best.parameters[2]), type = "C-classification", 
##     probability = TRUE, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4 
##       gamma:  2 
## 
## Number of Support Vectors:  10575

Then we score the model on testing dataset and evaluate the model performance.

# Score model 

predictions <- predict(m.svm, ds[test, vars], probability=TRUE)
threshold <- 0.5
svm_probability <- attr(predictions, 'probabilities')[, 2]
svm_prediction <- ifelse(svm_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], svm_prediction, svm_probability)
head(pred)
##   continue_drop gender caste mathematics_marks english_marks science_marks
## 1      continue      f    bc             0.290         0.512         0.290
## 2      continue      f    oc             0.602         0.666         0.602
## 3      continue      f    bc             0.594         0.519         0.594
## 4      continue      f    bc             0.461         0.524         0.461
## 5      continue      f    oc             0.742         0.672         0.742
## 6          drop      f    bc             0.503         0.523         0.503
##   science_teacher languages_teacher guardian internet total_students
## 1               4                 7   mother     true            356
## 2               4                 2   mother    false            179
## 3               4                 8   mother     true            335
## 4               0                 3   mother     true            469
## 5               3                12   mother     true            132
## 6               9                 0   father     true            397
##   total_toilets establishment_year svm_prediction svm_probability
## 1            14               1943              0      0.03416375
## 2             8               1955              0      0.03744888
## 3            43               1916              0      0.03462933
## 4            14               1905              0      0.02757413
## 5            14               1996              0      0.03769413
## 6             5               1950              1      0.94207574
# Evaluate model

pred$continue_drop <- as.numeric(pred$continue_drop)-1

metrics.svm <- evaluateModel(data=pred,
                              observed="continue_drop",
                              predicted="svm_prediction")
##         Predicted
## Observed    0    1
##        0 5461   14
##        1  174   81
metrics.svm
##  Accuracy Precision    Recall   F-Score 
## 0.9671902 0.8526316 0.3176471 0.4628571
rocChart(pr=pred$svm_probability, target=pred$continue_drop)

Step 7.2: Other Models - Neural Network Model

Next we build a nnet(), neural network model.

# Tune hyper-parameters

system.time({
m.nnet.cv <- tune.nnet(form, 
                       data=ds[train, vars], 
                       size=c(2, 4, 6, 8, 10), 
                       decay=5*10^(-5:-1), 
                       rang=0.1,
                       maxit=200)
})
##    user  system elapsed 
## 3846.23    3.24 3849.90
print(m.nnet.cv$best.performance)
## [1] 0.02969334
# Train model: nnet

system.time({
  m.nnet <- nnet(formula=form,
                 data=ds[train, vars],
                 size=as.numeric(m.nnet.cv$best.parameters[1]), 
                 decay=as.numeric(m.nnet.cv$best.parameters[2]), 
                 rang=0.1,
                 maxit=200)
})
## # weights:  109
## initial  value 8795.635722 
## iter  10 value 2584.626927
## iter  20 value 2584.251558
## iter  30 value 2579.904241
## iter  40 value 2551.472270
## iter  50 value 2438.169621
## iter  60 value 2245.353974
## iter  70 value 2181.523202
## iter  80 value 2079.203645
## iter  90 value 1844.057296
## iter 100 value 1736.034111
## iter 110 value 1722.882974
## iter 120 value 1637.740645
## iter 130 value 1447.625951
## iter 140 value 1268.792633
## iter 150 value 1003.792683
## iter 160 value 822.964306
## iter 170 value 720.855650
## iter 180 value 677.623554
## iter 190 value 668.045871
## iter 200 value 636.599741
## final  value 636.599741 
## stopped after 200 iterations
##    user  system elapsed 
##    7.52    0.00    7.52
# Check the model information

m.nnet
## a 16-6-1 network with 109 weights
## inputs: genderm casteoc castesc castest mathematics_marks english_marks science_marks science_teacher languages_teacher guardianmixed guardianmother guardianother internettrue total_students total_toilets establishment_year 
## output(s): continue_drop 
## options were - entropy fitting  decay=0.05

Then we score the model on testing dataset and evaluate the model performance.

# Score model

predictions <- predict(m.nnet, ds[test, vars], type="raw")
threshold <- 0.5
nnet_probability <- predictions
nnet_prediction <- ifelse(nnet_probability > threshold, 1, 0)
pred <- cbind(ds[test, vars], nnet_prediction, nnet_probability)
head(pred)
##   continue_drop gender caste mathematics_marks english_marks science_marks
## 1      continue      f    bc             0.290         0.512         0.290
## 2      continue      f    oc             0.602         0.666         0.602
## 3      continue      f    bc             0.594         0.519         0.594
## 4      continue      f    bc             0.461         0.524         0.461
## 5      continue      f    oc             0.742         0.672         0.742
## 6          drop      f    bc             0.503         0.523         0.503
##   science_teacher languages_teacher guardian internet total_students
## 1               4                 7   mother     true            356
## 2               4                 2   mother    false            179
## 3               4                 8   mother     true            335
## 4               0                 3   mother     true            469
## 5               3                12   mother     true            132
## 6               9                 0   father     true            397
##   total_toilets establishment_year nnet_prediction nnet_probability
## 1            14               1943               0     2.230320e-02
## 2             8               1955               0     2.911806e-05
## 3            43               1916               0     2.894795e-05
## 4            14               1905               0     1.020412e-05
## 5            14               1996               0     1.122212e-05
## 6             5               1950               1     9.772464e-01
# Evaluate model

pred$continue_drop <- as.numeric(pred$continue_drop)-1

metrics.nnet <- evaluateModel(data=pred,
                              observed="continue_drop",
                              predicted="nnet_prediction")
##         Predicted
## Observed    0    1
##        0 5474    1
##        1   31  224
metrics.nnet
##  Accuracy Precision    Recall   F-Score 
## 0.9944154 0.9955556 0.8784314 0.9333333
rocChart(pr=pred$nnet_probability, target=pred$continue_drop)

Step 7.3: Other Models - Extreme Gradient Boosting Model

Finally, we build a xgboost() extreme gradient boosting, as a specicial example, which performs well when dealing with unbalanced data. In our case, the proportion of student drop-out is around 5% in the original training dataset. Here we just use it as input to demonstrate the power of xgboost() in dealing with unbalanced data.

# Re-structure the training data set

traindata <- ds[train, inputs]

traindata[, c(1:ncol(traindata))] <- sapply(traindata[, c(1:ncol(traindata))], as.numeric) 
ntrain <- as.matrix(traindata[ , c(1:ncol(traindata))])

dtrain <- list()
dtrain$data <- Matrix(ntrain, sparse=TRUE)
dtrain$label <- as.numeric(as.data.frame(ds[train, target])[[1]]) - 1

dtrain %>% str()
## List of 2
##  $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:158037] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..@ p       : int [1:13] 0 13370 26740 40110 53480 66850 79090 91187 104557 117927 ...
##   .. ..@ Dim     : int [1:2] 13370 12
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:12] "gender" "caste" "mathematics_marks" "english_marks" ...
##   .. ..@ x       : num [1:158037] 1 1 1 1 1 1 1 2 2 2 ...
##   .. ..@ factors : list()
##  $ label: num [1:13370] 0 0 0 0 0 1 0 0 0 0 ...
# Tune hyper-parameters

cv.ctrl <- trainControl(method="cv",
                        number=5,
                        verboseIter=TRUE,
                        returnData=FALSE,
                        returnResamp="all",  # save losses across all models
                        classProbs=TRUE,     # set to TRUE for AUC to be computed
                        summaryFunction=twoClassSummary,
                        allowParallel=TRUE)

grid.xgb <- expand.grid(nrounds=2,
                        max_depth=2^(1:3),
                        eta=1*10^(-2:0),
                        min_child_weight=1,
                        colsample_bytree=1,
                        subsample=1,
                        gamma=0)

set.seed(45)
m.xgb.cv <-train(x=ntrain,
                 y=as.data.frame(ds[train, target])[[1]],
                 method="xgbTree",
                 trControl=cv.ctrl,
                 tuneGrid=grid.xgb,
                 verbose=TRUE,
                 metric="AUC",
                 nthread =2)
## + Fold1: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold1: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold1: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold2: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold2: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold3: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold3: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold4: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold4: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.01, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.01, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.01, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.10, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.10, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=0.10, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=1.00, max_depth=2, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=1.00, max_depth=4, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## + Fold5: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## - Fold5: eta=1.00, max_depth=8, gamma=0, colsample_bytree=1, min_child_weight=1, subsample=1, nrounds=2 
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 2, max_depth = 8, eta = 1, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1 on full training set
m.xgb.cv
## eXtreme Gradient Boosting 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10696, 10696, 10696, 10696, 10696 
## Resampling results across tuning parameters:
## 
##   eta   max_depth  ROC        Sens       Spec      
##   0.01  2          0.6868065  1.0000000  0.05581395
##   0.01  4          0.7739779  1.0000000  0.22015504
##   0.01  8          0.9785562  0.9971709  0.66046512
##   0.10  2          0.6868065  1.0000000  0.05581395
##   0.10  4          0.7767530  1.0000000  0.22015504
##   0.10  8          0.9871498  0.9995285  0.65581395
##   1.00  2          0.7065927  1.0000000  0.09457364
##   1.00  4          0.9152690  0.9993713  0.36899225
##   1.00  8          0.9997807  0.9975639  1.00000000
## 
## Tuning parameter 'nrounds' was held constant at a value of 2
##  1
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## ROC was used to select the optimal model using  the largest value.
## The final values used for the model were nrounds = 2, max_depth = 8, eta
##  = 1, gamma = 0, colsample_bytree = 1, min_child_weight = 1 and
##  subsample = 1.
# Train model: xgboost

system.time({
  m.xgb <- xgboost(data=dtrain$data, 
                   label=dtrain$label,
                   nround=m.xgb.cv$bestTune[[1]], 
                   max.depth=m.xgb.cv$bestTune[[2]], 
                   eta=m.xgb.cv$bestTune[[3]], 
                   min_child_weight=1,
                   colsample_bytree=1,
                   subsample=1,
                   gamma=0,
                   nthread=2, 
                   objective="binary:logistic")
})
## [1]  train-error:0.021242 
## [2]  train-error:0.000000
##    user  system elapsed 
##    0.06    0.00    0.04
m.xgb
## ##### xgb.Booster
## raw: 4.8 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max.depth = ..1, eta = ..2, min_child_weight = 1, 
##     colsample_bytree = 1, subsample = 1, gamma = 0, nthread = 2, 
##     objective = "binary:logistic")
## params (as set within xgb.train):
##   max_depth = "8", eta = "1", min_child_weight = "1", colsample_bytree = "1", subsample = "1", gamma = "0", nthread = "2", objective = "binary:logistic", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
##   cb.evaluation.log()
##   cb.save.model(save_period = save_period, save_name = save_name)
## niter: 2
## evaluation_log:
##  iter train_error
##     1    0.021242
##     2    0.000000
# Calculate feature importance

importance <- xgb.importance(feature_names=dtrain$data@Dimnames[[2]], 
                             model=m.xgb)
print(importance)
##              Feature       Gain       Cover  Frequency
## 1: mathematics_marks 0.32118756 0.143316075 0.22807018
## 2: languages_teacher 0.21710591 0.161043921 0.21052632
## 3:     english_marks 0.19545345 0.300644139 0.17543860
## 4:          guardian 0.07497293 0.107159411 0.07017544
## 5:   science_teacher 0.07398745 0.209856299 0.08771930
## 6:             caste 0.06018546 0.069855688 0.12280702
## 7:            gender 0.04196116 0.001953313 0.07017544
## 8:          internet 0.01514608 0.006171154 0.03508772
# Visualize feature importance

xgb.plot.importance(importance)

# Plot a boosted tree model

xgb.plot.tree(dtrain$data@Dimnames[[2]], model=m.xgb)

Now we score the model on testing dataset and evaluate the model performance.

# Re-structure the testing data set

testdata <- ds[test, inputs]

testdata[, c(1:ncol(traindata))] <- sapply(testdata[, c(1:ncol(traindata))], as.numeric) 
ntest <- as.matrix(testdata[, c(1:ncol(traindata))])

dtest <- list()
dtest$data <- Matrix(ntest, sparse=TRUE)
dtest$label <- as.numeric(as.data.frame(ds[test, target])[[1]]) - 1

dtest %>% str()
## List of 2
##  $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:67713] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..@ p       : int [1:13] 0 5730 11460 17190 22920 28650 33860 39063 44793 50523 ...
##   .. ..@ Dim     : int [1:2] 5730 12
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:12] "gender" "caste" "mathematics_marks" "english_marks" ...
##   .. ..@ x       : num [1:67713] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..@ factors : list()
##  $ label: num [1:5730] 0 0 0 0 0 1 0 0 0 0 ...
# Score model

predictions <- predict(m.xgb, dtest$data)
threshold <- 0.5
xgboost_probability <- predictions
xgboost_prediction <- ifelse(xgboost_probability > threshold, 1, 0)
pred <- cbind(testdata, dtest$label, xgboost_prediction, xgboost_probability)
names(pred)[names(pred) == "dtest$label"] <- target
head(pred)
##   gender caste mathematics_marks english_marks science_marks
## 1      1     1             0.290         0.512         0.290
## 2      1     2             0.602         0.666         0.602
## 3      1     1             0.594         0.519         0.594
## 4      1     1             0.461         0.524         0.461
## 5      1     2             0.742         0.672         0.742
## 6      1     1             0.503         0.523         0.503
##   science_teacher languages_teacher guardian internet total_students
## 1               4                 7        3        2            356
## 2               4                 2        3        1            179
## 3               4                 8        3        2            335
## 4               0                 3        3        2            469
## 5               3                12        3        2            132
## 6               9                 0        1        2            397
##   total_toilets establishment_year continue_drop xgboost_prediction
## 1            14               1943             0                  0
## 2             8               1955             0                  0
## 3            43               1916             0                  0
## 4            14               1905             0                  0
## 5            14               1996             0                  0
## 6             5               1950             1                  1
##   xgboost_probability
## 1          0.04225920
## 2          0.04254206
## 3          0.04225920
## 4          0.04254206
## 5          0.04225920
## 6          0.94488204
# Evaluate model

metrics.xgb <- evaluateModel(data=pred,
                             observed="continue_drop",
                             predicted="xgboost_prediction")
##         Predicted
## Observed    0    1
##        0 5475    0
##        1    0  255
metrics.xgb
##  Accuracy Precision    Recall   F-Score 
##         1         1         1         1
rocChart(pr=pred$xgboost_probability, target=pred$continue_drop)

Step 8: Finish Up - Save Model

We save the model, together with the dataset and other variables, into a binary R file. Here, we use xgboost model as an example.

model <- m.xgb
mtype <- 'xgboost'
pr <- xgboost_probability
cl <- xgboost_prediction

dname <- "models"
if (! file.exists(dname)) dir.create(dname)
time.stamp <- format(Sys.time(), "%Y%m%d_%H%M%S")
fstem <- paste(dsname, mtype, time.stamp, sep="_")
(fname <- file.path(dname, sprintf("%s.RData", fstem)))
## [1] "models/studentDropIndia_xgboost_20170810_123440.RData"
save(ds, dsname, vars, target, ignore,
form, nobs, seed, train, test, model, mtype, pr, cl,
file=fname)

We can then load this later and replicate the process.

(load(fname))
##  [1] "ds"     "dsname" "vars"   "target" "ignore" "form"   "nobs"  
##  [8] "seed"   "train"  "test"   "model"  "mtype"  "pr"     "cl"

Note that by using generic variable names we can load different model files and perform common operations on them without changing the names within a script. However, do note that each time we load such a saved model file we overwrite any other variables of the same name.